Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] SBT feature: combine trees #143

Merged
merged 12 commits into from
Apr 10, 2017
Merged

[MRG] SBT feature: combine trees #143

merged 12 commits into from
Apr 10, 2017

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Mar 8, 2017

Fixes #141

The initial implementation just went thru the leaves in another tree and add to the original one. This is not so efficient because we recreate internal nodes (a lot).

A better solution is to create a new root, make each original tree a child of the new root, and reuse all the internal nodes already calculated. This is implemented now.

Possible improvements:

  1. The current solution (create a new root) works well with binary trees, but if d>2 we could instead check to see if there is a position available near the root before inserting. This is more visible if you try to combine trees in sequence.
  2. Since I'm not reorganizing the tree properly, the next available position will be between the combine trees (there might be an empty position on the left tree before we inserted on the right tree, for example). Need to do some more bookkeeping for the next available position.
  3. Trees are not balanced anymore (there might be leaves in places other than the last level).

A more powerful solution would be to do a depth-first search in the original trees and replicate what is found in the new one, this would also support better case 1.

For 2 and 3, we only need that because the current add_node implementation expect the tree to be filled in order, without gaps. Other methods for adding nodes might not need it, tho (I'm thinking about doing a guided insertion, putting signatures closer to places where they have higher similarity, for example).

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@luizirber luizirber mentioned this pull request Mar 8, 2017
@codecov-io
Copy link

codecov-io commented Mar 10, 2017

Codecov Report

Merging #143 into master will increase coverage by 0.14%.
The diff coverage is 83.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #143      +/-   ##
==========================================
+ Coverage   85.37%   85.52%   +0.14%     
==========================================
  Files          13       13              
  Lines        1833     1879      +46     
  Branches       52       52              
==========================================
+ Hits         1565     1607      +42     
- Misses        259      263       +4     
  Partials        9        9
Impacted Files Coverage Δ
sourmash_lib/__main__.py 100% <ø> (ø) ⬆️
sourmash_lib/commands.py 89.44% <100%> (+0.3%) ⬆️
sourmash_lib/sbt.py 83.03% <77.96%> (+0.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ab2433...f16930a. Read the comment docs.

@luizirber
Copy link
Member Author

luizirber commented Mar 10, 2017

@viehwegerlib wanna take this branch for a field test with your SBTs? =]

If your trees are viral.sbt.json and microbial.sbt.json,
and a signature file in sig_file.sig,
this should work:

from sourmash_lib import signature
from sourmash_lib.sbt import SBT
from sourmash_lib.sbtmh import SigLeaf, search_minhashes

sig = next(signature.load_signatures('sig_file.sig'))

tree_viral = SBT.load("viral.sbt.json", leaf_loader=SigLeaf.load)
tree_microbial = SBT.load("microbial.sbt.json", leaf_loader=SigLeaf.load)

results_viral = {str(s) for s in tree_viral.find(search_minhashes, sig, 0.1)}
results_microbial = {str(s) for s in tree_microbial.find(search_minhashes, sig, 0.1)}

new_tree = tree_viral.combine(tree_microbial)
# new_tree and tree_viral are the same thing, from this point on

results = {str(s) for s in new_tree.find(search_minhashes, sig, 0.1)}

assert results == (results_viral | results_microbial)

@phiweger
Copy link

pure awesome, will try right now

@phiweger
Copy link

All works well till I try to write the tree so disk. How is the .save() method used properly?

new_tree.save(tag='/path/to/combined_tree')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-f8de71c2da88> in <module>()
----> 1 new_tree.save(tag='/Users/pi/tmp/combined')

/Users/pi/.virtualenvs/lab3/lib/python3.5/site-packages/sourmash_lib/sbt.py in save(self, tag)
    195                 continue
    196
--> 197             basename = os.path.basename(node.name)
    198             data = {
    199                 'filename': os.path.join('.sbt.' + basetag,

AttributeError: 'LazyNode' object has no attribute 'name'

@luizirber
Copy link
Member Author

Oh LazyNode, bane of my existence... #85 will fix that, but for now I'll add the necessary .do_load() call.

@phiweger
Copy link

phiweger commented Mar 10, 2017

It saves, nice. However, when sbt_searching in the combined tree (microbial + virus) for a signature that I indexed in the virus SBT, nothing similar is found, but when sbt_searching the viral SBT alone, the signature retrieves "itself" as the most similar hit.

@luizirber
Copy link
Member Author

luizirber commented Mar 10, 2017

@viehwegerlib hmm. Can you share the trees with me somehow? I tried with viral refseq SBT + bacteria refseq SBT and it worked (virus sig found in both viral and combined tree)

@luizirber
Copy link
Member Author

I also tried to create a chimera (copied 250 hashes from a virus, 250 from a bacteria) and got 3 matches (one viral, two bacterial). in the combined tree.

@phiweger
Copy link

I could send you the files via wetransfer.com, where do you want me to send them?

@phiweger
Copy link

All's well - that's what I get for coding with little coffee. I made a mistake in preparing the test files. The combined tree works perfect. Very cool, thanks a lot!

@luizirber luizirber changed the title [WIP] SBT feature: combine trees [MRG] SBT feature: combine trees Mar 29, 2017
# An internal node, we need to update the name
new_node.name = "internal.{}".format(current_pos)
new_nodes[current_pos] = new_node
if tree.nodes[pos] is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not simple an else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! Thanks!

self.nodes[p.pos] = n
c1 = self.children(p.pos)[0]
self.nodes[c1.pos] = node
node.update(n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a test to get some coverage here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this happen when using bigger SBTs, I'll try to cook up a test without putting too much data in the repo

Copy link
Contributor

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the branch, decide on the two comments, otherwise LGTM

@ctb
Copy link
Contributor

ctb commented Apr 9, 2017

This looks ready to merge @luizirber - shall I push the magic button?

@ctb
Copy link
Contributor

ctb commented Apr 9, 2017

Whups, update from master is nontrivial...

@luizirber luizirber merged commit c0e3476 into master Apr 10, 2017
@luizirber luizirber deleted the feature/combine branch April 10, 2017 21:04
@luizirber luizirber added the sbt label May 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

combine SBTs
5 participants